53 research outputs found
Towards Structural Classification of Proteins based on Contact Map Overlap
A multitude of measures have been proposed to quantify the similarity between
protein 3-D structure. Among these measures, contact map overlap (CMO)
maximization deserved sustained attention during past decade because it offers
a fine estimation of the natural homology relation between proteins. Despite
this large involvement of the bioinformatics and computer science community,
the performance of known algorithms remains modest. Due to the complexity of
the problem, they got stuck on relatively small instances and are not
applicable for large scale comparison. This paper offers a clear improvement
over past methods in this respect. We present a new integer programming model
for CMO and propose an exact B &B algorithm with bounds computed by solving
Lagrangian relaxation. The efficiency of the approach is demonstrated on a
popular small benchmark (Skolnick set, 40 domains). On this set our algorithm
significantly outperforms the best existing exact algorithms, and yet provides
lower and upper bounds of better quality. Some hard CMO instances have been
solved for the first time and within reasonable time limits. From the values of
the running time and the relative gap (relative difference between upper and
lower bounds), we obtained the right classification for this test. These
encouraging result led us to design a harder benchmark to better assess the
classification capability of our approach. We constructed a large scale set of
300 protein domains (a subset of ASTRAL database) that we have called Proteus
300. Using the relative gap of any of the 44850 couples as a similarity
measure, we obtained a classification in very good agreement with SCOP. Our
algorithm provides thus a powerful classification tool for large structure
databases
Solving Maximum Clique Problem for Protein Structure Similarity
A basic assumption of molecular biology is that proteins sharing close
three-dimensional (3D) structures are likely to share a common function and in
most cases derive from a same ancestor. Computing the similarity between two
protein structures is therefore a crucial task and has been extensively
investigated. Evaluating the similarity of two proteins can be done by finding
an optimal one-to-one matching between their components, which is equivalent to
identifying a maximum weighted clique in a specific "alignment graph". In this
paper we present a new integer programming formulation for solving such clique
problems. The model has been implemented using the ILOG CPLEX Callable Library.
In addition, we designed a dedicated branch and bound algorithm for solving the
maximum cardinality clique problem. Both approaches have been integrated in
VAST (Vector Alignment Search Tool) - a software for aligning protein 3D
structures largely used in NCBI (National Center for Biotechnology
Information). The original VAST clique solver uses the well known Bron and
Kerbosh algorithm (BK). Our computational results on real life protein
alignment instances show that our branch and bound algorithm is up to 116 times
faster than BK for the largest proteins
Functional geometry of protein interactomes
Motivation
Protein–protein interactions (PPIs) are usually modeled as networks. These networks have extensively been studied using graphlets, small induced subgraphs capturing the local wiring patterns around nodes in networks. They revealed that proteins involved in similar functions tend to be similarly wired. However, such simple models can only represent pairwise relationships and cannot fully capture the higher-order organization of protein interactomes, including protein complexes.
Results
To model the multi-scale organization of these complex biological systems, we utilize simplicial complexes from computational geometry. The question is how to mine these new representations of protein interactomes to reveal additional biological information. To address this, we define simplets, a generalization of graphlets to simplicial complexes. By using simplets, we define a sensitive measure of similarity between simplicial complex representations that allows for clustering them according to their data types better than clustering them by using other state-of-the-art measures, e.g. spectral distance, or facet distribution distance. We model human and baker’s yeast protein interactomes as simplicial complexes that capture PPIs and protein complexes as simplices. On these models, we show that our newly introduced simplet-based methods cluster proteins by function better than the clustering methods that use the standard PPI networks, uncovering the new underlying functional organization of the cell. We demonstrate the existence of the functional geometry in the protein interactome data and the superiority of our simplet-based methods to effectively mine for new biological information hidden in the complexity of the higher-order organization of protein interactomes.This work was supported by the European Research Council (ERC) Starting Independent Researcher Grant 278212, the European Research Council (ERC) Consolidator Grant 770827, the Serbian Ministry of Education and Science Project III44006, the Slovenian Research Agency project J1-8155 and the awards to establish the Farr Institute of Health Informatics Research, London, from the Medical Research Council, Arthritis Research UK, British Heart Foundation, Cancer Research UK, Chief Scientist Office, Economic and Social Research Council, Engineering and Physical Sciences Research Council, National Institute for Health Research, National Institute for Social Care and Health Research, and Wellcome Trust (grant MR/K006584/1).Peer ReviewedPostprint (author's final draft
Shape Matching by Localized Calculations of Quasi-isometric Subsets, with Applications to the Comparison of Protein Binding Patches
International audienceGiven a protein complex involving two partners, the receptor and the ligand, this paper addresses the problem of comparing their binding patches, i.e. the sets of atoms accounting for their interaction. This problem has been classically addressed by searching quasi-isometric subsets of atoms within the patches, a task equivalent to a maximum clique problem, a NP-hard problem, so that practical binding patches involving up to 300 atoms cannot be handled. We extend previous work in two directions. First, we present a generic encoding of shapes represented as cell complexes. We partition a shape into concentric shells, based on the shelling order of the cells of the complex. The shelling order yields a shelling tree encoding the geometry and the topology of the shape. Second, for the particular case of cell complexes representing protein binding patches, we present three novel shape comparison algorithms. These algorithms combine a Tree Edit Distance calculation (TED) on shelling trees, together with Edit operations respectively favoring a topological or a geometric comparison of the patches. We show in particular that the geometric TED calculation strikes a balance, in terms of accuracy and running time between a purely geometric and topological comparisons, and we briefly comment on the biological findings reported in a companion paper.Étant donné un complexe protéique impliquant deux partenaires, un récepteur et un ligand, ce papier étudie le problème de comparer leur patchs de liaison, i.e. les ensembles d'atomes participant à leur interaction. Ce problème est classiquement formulé comme une recherche de sous-ensembles d'atomes quasi-isométriques entre les deux patchs, une tâche qui est équivalente à une recherche de cliques maximums. Ce problème étant NP-difficile, des patchs de liaison impliquant plus de 300 atomes ne peuvent-être traités. Nous étendons les travaux précédant dans deux directions. Premièrement, nous présentons un encodage générique pour les formes représentées par des complexes cellulaires. Nous partitionnons une forme en couches concentriques, basées sur ''l'ordre de couche'' des cellules du complexe. L'ordre des couches produisant un arbre de couches qui encode la géométrie et la topologie de la forme. Deuxièmement, pour le cas particulier de complexes cellulaires représentant des patchs de liaison de complexes protéiques, nous proposons trois algorithmes de comparaison de formes. Ces algorithmes combinent une distance d'édition d'arbre (TED, pour tree-edit-distance) sur les arbres de couches, avec des opérations d'éditions favorisant respectivement la comparaison topologique ou géométrique des patchs. Nous montrons en particulier que la TED géométrique établit un équilibre, en termes de précision et de temps de calculs, entre des comparaisons purement géométriques ou purement topologiques, et nous commentons brièvement les résultats biologiques qui sont détaillés dans un article compagnon
Comparing Protein 3D Structures Using A_purva
Structural similarity between proteins provides significant insights about their functions. Maximum Contact Map Overlap maximization (CMO) received sustained attention during the past decade and can be considered today as a credible protein structure measure. We present here A_purva, an exact CMO solver that is both efficient (notably faster than the previous exact algorithms), and reliable (providing accurate upper and lower bounds of the solution). These properties make it applicable for large-scale protein comparison and classification. Availability: http://apurva.genouest.org Contact: [email protected] Supplementary information: A_purva's user manual, as well as many examples of protein contact maps can be found on A_purva's web-page.La similarité structurale entre protéines donne des renseignements importants sur leurs fonctions. La maximisation du recouvrement de cartes de contacts (CMO) a reçu une attention soutenue ces dix dernières années, et est maintenant considérée comme une mesure de similarité crédible. Nous présentons içi A_purva, un solveur de CMO exacte qui est à la fois efficace (plus rapide que les autres algorithmes exactes) et fiable (fournit des bornes supérieures et inférieures précises de la solution). Ces propriétés le rendent applicable pour des comparaisons et des classifications de protéines à grandes échelles. Disponibilité : http://apurva.genouest.org Contact : [email protected] Informations supplémentaires : Le manuel utilisateur d'A_purva, ainsi que de nombreux exemples de cartes de contacts de protéines sont disponibles sur le site web d'A_purva
Identifying cellular cancer mechanisms through pathway-driven data integration
Abstract
Motivation
Cancer is a genetic disease in which accumulated mutations of driver genes induce a functional reorganization of the cell by reprogramming cellular pathways. Current approaches identify cancer pathways as those most internally perturbed by gene expression changes. However, driver genes characteristically perform hub roles between pathways. Therefore, we hypothesize that cancer pathways should be identified by changes in their pathway–pathway relationships.
Results
To learn an embedding space that captures the relationships between pathways in a healthy cell, we propose pathway-driven non-negative matrix tri-factorization. In this space, we determine condition-specific (i.e. diseased and healthy) embeddings of pathways and genes. Based on these embeddings, we define our ‘NMTF centrality’ to measure a pathway’s or gene’s functional importance, and our ‘moving distance’, to measure the change in its functional relationships. We combine both measures to predict 15 genes and pathways involved in four major cancers, predicting 60 gene–cancer associations in total, covering 28 unique genes. To further exploit driver genes’ tendency to perform hub roles, we model our network data using graphlet adjacency, which considers nodes adjacent if their interaction patterns form specific shapes (e.g. paths or triangles). We find that the predicted genes rewire pathway–pathway interactions in the immune system and provide literary evidence that many are druggable (15/28) and implicated in the associated cancers (47/60). We predict six druggable cancer-specific drug targets.This work was supported by the European Research Council (ERC) Consolidator Grant 770827 and the Spanish State Research Agency AEI 10.13039/501100011033 [grant number PID2019-105500GB-I00].Peer ReviewedPostprint (published version
Graphlet eigencentralities capture novel central roles of genes in pathways
Motivation
Graphlet adjacency extends regular node adjacency in a network by considering a pair of nodes being adjacent if they participate in a given graphlet (small, connected, induced subgraph). Graphlet adjacencies captured by different graphlets were shown to contain complementary biological functions and cancer mechanisms. To further investigate the relationships between the topological features of genes participating in molecular networks, as captured by graphlet adjacencies, and their biological functions, we build more descriptive pathway-based approaches.
Contribution
We introduce a new graphlet-based definition of eigencentrality of genes in a pathway, graphlet eigencentrality, to identify pathways and cancer mechanisms described by a given graphlet adjacency. We compute the centrality of genes in a pathway either from the local perspective of the pathway or from the global perspective of the entire network.
Results
We show that in molecular networks of human and yeast, different local graphlet adjacencies describe different pathways (i.e., all the genes that are functionally important in a pathway are also considered topologically important by their local graphlet eigencentrality). Pathways described by the same graphlet adjacency are functionally similar, suggesting that each graphlet adjacency captures different pathway topology and function relationships. Additionally, we show that different graphlet eigencentralities describe different cancer driver genes that play central roles in pathways, or in the crosstalk between them (i.e. we can predict cancer driver genes participating in a pathway by their local or global graphlet eigencentrality). This result suggests that by considering different graphlet eigencentralities, we can capture different functional roles of genes in and between pathwaysThis study received support from the following sources: The European Research Council (ERC) Consolidator Grant 770827 (awarded to NP); The Spanish State Research Agency AEI 10.13039/501100011033 grant number PID2019-105500GB-I00 (awarded to NP); and University College London Computer Science (awarded to SW). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Peer ReviewedPostprint (published version
Characterizing the Morphology of Protein Binding Patches
International audienceLet the patch of a partner in a protein complex be the collection of atoms accounting for the interaction. To improve our understanding of the structure-function relationship, we present a patch model decoupling the topological and geometric properties. While the geometry is classically encoded by the atomic positions, the topology is recorded in a graph encoding the relative position of concentric shells partitioning the interface atoms. The topological-geometric duality provides the basis of a generic dynamic programming-based algorithm comparing patches at the shell level, which may favor topological or geometric features. On the biological side, we address four questions, using 249 cocrystallized heterodimers organized in biological families. First, we dissect the morphology of binding patches and show that Nature enjoyed the topological and geometric degrees of freedom independently while retaining a finite set of qualitatively distinct topological signatures. Second, we argue that our shell-based comparison is effective to perform atomic-level comparisons and show that topological similarity is a less stringent than geometric similarity. We also use the topological versus geometric duality to exhibit topo-rigid patches, whose topology (but not geometry) remains stable upon docking. Third, we use our comparison algorithms to infer specificity-related information amidst a database of complexes. Finally, we exhibit a descriptor outperforming its contenders to predict the binding affinities of the affinity benchmark. The softwares developed with this article are available from http://team.inria.fr/abs/vorpatch_compatch/
A functional analysis of omic network embedding spaces reveals key altered functions in cancer
Abstract
Motivation
Advances in omics technologies have revolutionized cancer research by producing massive datasets. Common approaches to deciphering these complex data are by embedding algorithms of molecular interaction networks. These algorithms find a low-dimensional space in which similarities between the network nodes are best preserved. Currently available embedding approaches mine the gene embeddings directly to uncover new cancer-related knowledge. However, these gene-centric approaches produce incomplete knowledge, since they do not account for the functional implications of genomic alterations. We propose a new, function-centric perspective and approach, to complement the knowledge obtained from omic data.
Results
We introduce our Functional Mapping Matrix (FMM) to explore the functional organization of different tissue-specific and species-specific embedding spaces generated by a Non-negative Matrix Tri-Factorization algorithm. Also, we use our FMM to define the optimal dimensionality of these molecular interaction network embedding spaces. For this optimal dimensionality, we compare the FMMs of the most prevalent cancers in human to FMMs of their corresponding control tissues. We find that cancer alters the positions in the embedding space of cancer-related functions, while it keeps the positions of the noncancer-related ones. We exploit this spacial ‘movement’ to predict novel cancer-related functions. Finally, we predict novel cancer-related genes that the currently available methods for gene-centric analyses cannot identify; we validate these predictions by literature curation and retrospective analyses of patient survival data.This project has received funding from the European Research Council (ERC) Consolidator Grant 770827 and the Spanish State Research Agency AEI 10.13039/501100011033 grant number PID2019-105500GB-I00.Peer ReviewedPostprint (published version
- …